Solving Data Imbalance in Text Classification with Constructing Contrastive Samples
نویسندگان
چکیده
Contrastive learning (CL) has been successfully applied in Natural Language Processing (NLP) as a powerful representation method and shown promising results various downstream tasks. Recent research highlighted the importance of constructing effective contrastive samples through data augmentation. However, current augmentation methods primarily rely on random word deletion, substitution, cropping, which may introduce noisy hinder learning. In this article, we propose novel approach to address imbalance text classification by samples. Our involves use Label-indicative Component generate high-quality positive for minority class, along with introduction Hard Negative Mixing strategy synthesize challenging negative at feature level. By applying supervised these samples, are able obtain superior representations, significantly benefit tasks imbalanced data. effectively mitigates distributional biases promotes noise-resistant To validate effectiveness our method, conducted experiments benchmark datasets (THUCNews, AG’s News, 20NG) well FDCNews dataset. The code is publicly available following GitHub repository: https://github.com/hanggun/CLDMTC.
منابع مشابه
Effectively Constructing Reliable Data for Cross-Domain Text Classification
Traditional classification algorithms often fail when the independent and identical distributed (i.i.d.) assumption does not hold, and the cross-domain learning emerges recently is to deal with this problem. Actually, we observe that though the trained model from training data may not perform well over all test data, it can give much better prediction results on a subset of the test data with h...
متن کاملthe clustering and classification data mining techniques in insurance fraud detection:the case of iranian car insurance
با توجه به گسترش روز افزون تقلب در حوزه بیمه به خصوص در بخش بیمه اتومبیل و تبعات منفی آن برای شرکت های بیمه، به کارگیری روش های مناسب و کارآمد به منظور شناسایی و کشف تقلب در این حوزه امری ضروری است. درک الگوی موجود در داده های مربوط به مطالبات گزارش شده گذشته می تواند در کشف واقعی یا غیرواقعی بودن ادعای خسارت، مفید باشد. یکی از متداول ترین و پرکاربردترین راه های کشف الگوی داده ها استفاده از ر...
Improving classification of mature microRNA by solving class imbalance problem
MicroRNAs (miRNAs) are ~20-25 nucleotides non-coding RNAs, which regulated gene expression in the post-transcriptional level. The accurate rate of identifying the start sit of mature miRNA from a given pre-miRNA remains lower. It is noting that the mature miRNA prediction is a class-imbalanced problem which also leads to the unsatisfactory performance of these methods. We improved the predictio...
متن کاملA Novel Field Learning Algorithm for Dual Imbalance Text Classification
Fish-net algorithm is a novel field learning algorithm which derives classification rules by looking at the range of values of each attribute instead of the individual point values. In this paper, we present a Feature Selection Fish-net learning algorithm to solve the Dual Imbalance problem on text classification. Dual imbalance includes the instance imbalance and feature imbalance. The instanc...
متن کاملA Survey on Methods for Solving Data Imbalance Problem for Classification
The term “data imbalance” in classification is a well established phenomenon in which data set contains unbalanced class distributions. Dataset is called unbalanced if it contains at least one class which is presented by very few examples. A range of solutions have been proposed for the problem of data imbalance including data sampling, cost evaluation of model, bagging, boosting, Genetic Progr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Access
سال: 2023
ISSN: ['2169-3536']
DOI: https://doi.org/10.1109/access.2023.3306805